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ABSTRACT 

A software package realizes real-time video processing on 
a commercially available programmable device 1 . The 
software implements a motion estimator and a picture rate 
convenor to provide judder-free display of movie material 
broadcast in 2-3 pull-down mode. A new object-based 
true- mot ion estimation algorithm efficiently uses the VLTW 
core of the processor. It permits quasi- simultaneous 
motion estimation/segmentation for a fixed maximum 
number of objects. 



1 INTRODUCTION 

Picture sequences are generated at various picture rates: 
movie material at 24, 25, and 30 Hz, and video usually at 
50 and 60 Hz. Simple picture-rate convenors repeat 
pictures until the next one arrives, which results in blur 
and/or judder when motion occurs. Knowing the motion 
allows these effects to be eliminated. 

However, Motion Estimation (ME) has only recently 
reached the quality and price levels needed for consumer 
equipment [1]. This paper shows new progress, introducing 
ME and MC (motion compensation) for eliminating judder 
from 60 Hz TV movies using a software package that runs 
real time on a programmable device [2]. De-interlacing and 
movie detection on the VLIW core of the processor has 
already been reported [3]. The new concept is an extension 
of that work for high-end TVs and broadcast PCs. 

The software package implements motion-compensated 
picture interpolation with order statistical filtering [4] to 
guarantee robustness in the event of vector errors. This 
concept has proven useful before [ 1 j. The ME part of the 
design, however, is completely new. Rather than 
estimating vectors for blocks, we applied an object-based 
ME method. The software detects video material that 
originates from film, and automatically adapts to it. The 
software allows conversion to progressive and interlaced 
TV formats, and to VGA for PCs. 



! . The Philips TriMcdia processor is commercially available as 
the TM1000. 



Section 2 of this paper introduces the innovative object- 
based motion estimation algorithm, Section 3 discusses 
film-mode recognition, and Section 4 the up-conversion 
algorithm. Section 5 covers the software implementation, 
performance and use of resources. Our conclusions are 
given in Section 6. 



2 MOTION ESTIMATION 

Motion vectors are used in a wide range of applications 
such as coding, noise reduction, and scan rate conversion. 
Some of these applications, particularly frame rate 
conversion, require the true-motion of objects to be 
estimated [5, 6]. Other applications, for example 
interlaced-to-sequential scan conversion, demand a high 
accuracy of the motion vectors to achieve a low amplitude 
of remaining alias [7, 8]. There is a final category of 
applications, such as consumer ones, where the cost of the 
motion estimator is of crucial importance [9, 10]. 

Several algorithms have been proposed to achieve true- 
motion estimation [5,6, 10-13], and algorithms have also 
been proposed to realize motion estimation at a low 
complexity [9-11, 14-16]. In addition to the pel-recursive 
algorithms that usually allow sub-pixel accuracy [17, 18], a 
number of block-matching algorithms have been reported 
that yield highly accurate motion vectors [5, 19, 20]. Some 
years ago, a recursive type of block-matcher was proposed 
that combines the true-motion estimation required for 
frame rate conversion with the low complexity constraint 
needed for consumer applications [II]. A commercial 
implementation has improved the motion portrayal of film 
material shown on television |T]. 

The new motion estimator described in this paper is 
suitable for scan-rate conversion, with even lower 
computational complexity than before [11]. The aim was to 
make the reduction so significant that the algorithm would 
run in real time on the VLIW core of the processor. 

A first reduction is achieved by designing an object-based 
motion estimator instead of a block-based motion 
estimator. We expect fewer objects than blocks in realistic 
images, which implies that fewer motion parameters have 
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to be estimated. We have assumed that the motion of each 
object can be described with a parametric model. 

A second reduction is achieved by reducing the resolution 
of the images on which the motion is estimated (sub- 
sampling). 

The new object-based motion estimator can distinguish a 
maximum of three different objects. Here, "objects" mean 
image section(s) that can be described using the same 
motion model, and do not necessarily correspond to a 
sinale physical object in the scene. 

In the following subsections, we shall describe the motion 
model, the estimation of the model parameters, the cost 
function, and the segmentation of the image. 

2.1 Motion model 

To keep complexity low, the motion of an object / is 
described by a simple translational model 1 , 



D(xJ,n) = 



(1) 



using D (Jc, /, n) for the displacement vector of object / at 
location x = (jc, y) T in the image with index n. 

2.2 Parameter estimation 

Given a motion model, the next problem is to estimate its 
parameters to yield a useful description for an object in the 
image. As it is vital for almost every sequence to deal 
correctly with stationary image sections, the algorithm 
starts with an 'object 0', for which motion is described by 
the zero vector (no estimation effort is required for this). 
The parameter vectors of additional objects /, / > 0 are then 
estimated separately and in parallel, as shown in Figure 1, 
by their respective parameter estimators (PEf). Each PE[ 
has a basic principle very similar to that of the 3D recursive 
search block matcher of [11]. A previously estimated 
parameter vector is updated, after which the best parameter 
vector is selected according to a cost function. 

Considering the two-parameter model of eq.(l), the 
parameters of object /, / > 0, are regarded as a parameter 
vector Pj: 



P f (n) 



s x U.n) 
s v U,n) 



(2) 



and we define our task as being to select 7^ (/i) from a 
number of candidate parameter vectors CF^f(n) as the one 
that has the minimum value of a cost function, to which we 
shall return later. 



1 . More complex parametric motion models have been proposed 
[21 ] and can indeed be used in combination with the proposed 
algorithm, but will be disregarded here. 



current previous 
image data image data 



P 0 (n)= (0.0) 



X, : 



PE/ 



Wfx) I 
X, 



SEGMENTATION 



M(B.n) 



Fig. 1 Block diagram of multiple parameter estimators and 
segmentation. 



The candidates are generated in a very similar manner to 
the strategy exploited in [11]: we take a prediction vector, 
add at least one update vector, and select the best candidate- 
parameter vector according to cost function eq. (5). 
Cand idate parameter set CPSfoi) contains three candidates 
CPi(n) according to: 

CPSj(n) = {CPgin^CPjin) = (3) 

VP An) e S_ (/i) , 
' UP, 

m = -l,0, 1 } 



with update parameter UPj(n) selected from update 
parameter set UPSi(n): 



(i= 1,2,4, 8, 16) 



(4) 



2.3 The cost function 

Given the motion model and some candidate parameter 
sets, we need to select the best candidate for a given object 
according to a cost function result. The cost function can be 
a sum of absolute differences between motion compensated 
pixels from neighbouring images, with vectors generated 
by the (candidate) motion model. However, we need to 
know the area to which the motion model is to be assigned. 
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Wc now assemble the recent history set RH(n) as: 



RH(n) = (max(/i),max(n-l),max(/i-2), (9) 
, , max (a?, - 3) , max (n - 4) , max (n - 5) , 

■max (n - 6) } 

.which with ^jda^tive thresholding is converted into a binary 
mcVvie detection set MD(n) for 2-2 pull-down, that will give 
something like: 



MD(n) = {0, 1,0, 1,0, 1,0} 
for 2-3 pull-down something like: 
MD(n) = {0, 1,0,0, 1,0, 1} 
and for video something like: 
MD(n) = {1,1, 1, 1, 1, 1, 1} 



(10) 



(11) 



(12) 



Comparing the actual set with a limited number of known 
patterns yields information on movie type and phase. For 
scene cuts, the detector output is unreliable, and motion 
compensation should be switched off. 

4 THE UP-CONVERSION ALGORITHM 

The software supports conversion to progressive TV 
formats and to (progressive) VGA for PCs. The input 
material therefore needs de-interlacing. All supported input 
picture rates are also converted to the correct output picture 
rate, so temporal interpolation is also required. 

Material originating frorn^ film needs temporal 
interpolation. The de-interlacing in this case is a field 
merging, which requires no operations other than correct 
memory addressing. 

Material originating from video cameras needs no temporal 
interpolation as the output picture rate matches the input 
picture rate. However, field merging is not suitable as a de- 
interlacing method, since with motion both fields show 
objects at different positions. Therefore, a vertical- 
temporal median filter is implemented for de-interlacing. 

Depending on the film mode, the software operates either 
in a temporal interpolation mode for film or in a de- 
interlacing mode for video. 

Note thai the resources for motion estimation are necessary 
a!! the time, as without motion vectors it is impossible to 
reliably discriminate between video and film. 

4.1 Temporal interpolation 

An MC temporal interpolation algorithm with order 
statistical filtering guarantees robustness in the event of 
vector errors. This concept has proved useful before, in [I]. 

A straightforward motion compensated average would 
yield an intermediate picture according to: 



> 1 » -> > 

F. (jc, n) = 2 { F (x - O-D U n) , n - 1 ) + 

F(x + (1 -CL)D(x t n) t n)} 



(13) 



where a determines the temporal position of the 
interpolated image. We implement an order statistical filter 
to make the up-conversion more robust when coping with 
erroneous motion vectors. A very basic version was 
described in [4]. This version uses: 



F.(x t n) - med {F(x- aD (x, n) , n - 1 ), Av y 
F(x+ (1 -a)D(x, *),*)} 

with 

1 > * 

Av = 2{F(x,n) +F(x,n- 1)} 

and: 



(14) 



(15) 



a 

med (a, b,c) = \ b 
. c 



,(b<a<cv c<a<b) 
,(a<b<cv c<b<a) (16) 
, (otherwise) 



4.2 De-interlacing 

A vertical-temporal median filter [23] is used for de- 
interlacing video camera signals. As the temporal 
interpolation is required for movie material only, no 
additional resources are claimed for video. The processing 
power required for robust up-conversion implementing 
equation (14) for movie material, is consumed by 
interpolating intermediate lines for the current field n using 
equation (17) in case of video material. 



F. (x, n) = med {Fix, n ■ 



Fix- 



H 



(17) 



5 IMPLEMENTATION ON A PROGRAMMABLE 
DEVICE 

Figure 4 shows the block diagram of the processor on 
which the application runs. Figure 5 shows the sections of 
the algorithm that are running on the VLIW core of the 
processor. Table 1 shows some of the options that are 
implemented in the software. 
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Fig.4 Block diagram of the processor. 
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Table 1 Conversion characteristics of the software. 
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Fig. 5 Block diagram of the processing. 



Picture formats: 


Input 


Picture rate 
(pic/sec) 


Output 


525 lines/2: 1 


24,30 
(movie): 
60 (video) 


60 Hz (direct) 


720x480/2:1 
(CCIR601) 
720x480/1:1 


60 Hz (PCI) 


640x480/1:1 
(VGA) 


625 lines/2: 1 


25 (movie); 
50 (video) 


50 Hz (direct) 


720x576/2:1 
(CCIR601) 
720x576/1:1 


Velocity range (pixels/picture-period): 


Horizontal: 
Vertical: 


+/-32 
+/- 16 


De-interlacing algorithm: 


Movie: 
Video: 


Field-merging 
VT-median filtering 


Motion compensation algorithm: 


Movie: 
Video: 


Order statistical filiering 
None (not required) 



The next two subsections discuss the performance of the 
new object-based algorithm (which is somewhat worse 
than our reference block-based algorithm) and the resource 
usage (which is better). 

5.1 Performance 

The innovation in this system is the object-based motion 
estimator. It scores somewhat worse on the Modified Mean 
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previous segmentation (n-1) 




Fig.2 Example of selection of points of interest. 



These two issues, segmentation and motion estimation, are 
inter-dependent. To estimate the motion in an object 
correctly, the area of the object should be known and vice 
versa. , ,. 

We circumvented the chicken-and-egg problem by 
•introducing^ Hierarchy in the parameter estimators and by 
using segmentation masks calculated for the previous 
image. Each parameter estimator PE{ will calculate its cost 
function on the set of locations defined in set X, (called 
'points of interest'). Different locations can be assigned 
within set X, using a different weighting factor W } (x) 
dependent on location x . 
The hierarchy in the estimators is achieved by: 
• Selecting a set of points of interest x in X, as found in 
the previous image, by excluding locations sufficiently 
covered by higher ranked objects in the previous image. 
Each estimator, apart from the highest in the hierarchy 
(the zero estimator), minimizes a cost function 
calculated for objects in which all higher level 
estimators were unsuccessful in the previous image. 
The set of points of interest x in X/ is filled with the 
positions of x where the error function of all higher 
ranked objects exceeds the average block match error 
with a fixed factor. A correct selection of X/ is 
necessary to prevent the current estimator from 
estimating motion that is already covered by higher- 
ranked parameter estimators. 

• Reducing the effect of locations within X, that are 
potentially better covered by objects ranked lower in 
the hierarchy: assigning higher weights W f (x) to the 
pixels assigned to object / in the previous segmentation. 
The location dependent weighting factor Wj(x) is 
determined by the segmentation mask M(B, n-1) (see 
Section 2.4) found in the previous image. Positions * 
that belong to the current object / according to the 
segmentation mask will have a weighting factor greater 
than one, where positions belonging to a different 
object have a weighting factor of one. A correct 
selection of W ( (x) is necessary to prevent the current 
estimator from estimating motion that can be covered 
by lower-ranked parameter estimators. 



Figure 2 gives an example of how the points of interest are 
selected in an image with two moving objects. The moving 
background of the image is object /= 1, and there is a 
smaller object 1-2. Set Xj will be the entire image 
(everything not sufficiently covered by / = 0), and X 2 
covers the part of the image that is not sufficiently covered 
by / = 0 and / = 1 . In the part of the image that was covered 
by object / = 1 in the previous segmentation, W { (x) > 1 
results, with VV, (x) - 1 in the section covered brother- 
objects. In the section covered by object / = 2, W 2 (x) > 1 
results. Since X 2 does not extend beyond the area of object 
/ = 2, there is no section where W 2 {x) equals one. 
More formally, the cost function is calculated according to: 

£'(CP,(n)) = (5) 

£(CP,(»» + X W I& -H (CP/ (■«))■ 

.V € X, 

where penalties n ( ~C? l («) ) arc added to the match error 
of individual candidate vectors (parameters sets) to obtain 
temporal smoothness, and £ is: 



" 1 '"i "» ■ •'■"■'"I'lfi 
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e(CP,(/i)) = 

X W i& ' ^U") -F s (x-D(l t l 9 n),n-\) 



X € A", 



(6) 



e(5, /, n) 



AG B 



■F s (x-aD(xJ,n),n- 1) 



(7) 



Here F ¥ (*, #i) is the luminance value at position x in a 
sub-sampled image with index rt. 

The sub-sampling effectively reduces the required memory 
bandwidth. Images are sub-sampled with a factor of 4 
horizontally and 2 vertically on a field base, generating a 
sub-sampled image F s (n) from each original field F(n). 
To achieve pixel accuracy on the original pixel grid of F, 
interpolation is required on the sub-sampling grid. 




Fig. 3 Picture and resulting segmentation. 



2.4 Segmentation 

Segmentation is the most critical step in the algorithm. Its 
task is to assign one motion model to each group of pixels. 
This is basically achieved by assigning the best matching 
model to each group of pixels (a block 5, which is 
typically as large as 8 x 8 pixels on frame base). 

For each block, a match error is calculated according to: 



Segmentation mask M(B, n) assigns the object / with the 
lowest e to the block B. The temporal position of the 
segmentation is defined by a, which was set to 1/2 in our 
experiments. 

To save processing power, the segmentation mask M does 
not need to be calculated for every block B, Instead, the 
calculated blocks can be sub-sampled in a quincunx 
pattern, after which the missing positions in the 
segmentation mask are interpolated (for example by 
choosing the most frequently occurring object number 
from a neighbourhood) [11]. 

Segmentation is more difficult when many objects occur in 
the image, since the segmentation task will resemble more 
and more that of a full search block matcher. Extra 
(smoothing) measures have been added to prevent an 
output from the motion estimator having inconsistencies 
similar to those of a full search block matcher. 

These measures are: 

• Overlapping blocks: by taking a larger window when 
calculating the e than the size of the block B to which 
the object is assigned (spatial smoothing). 

• Bonus system: by reducing the calculated e of an object 
using a bonus value if this object was chosen in the 
segmentation of the previous image or in spatially 
neighbouring blocks (temporal and spatial smoothing). 

Figure 3 gives an example of a segmentation according to 
the object-based motion estimation method, with the 
original luminance image. 



3 FILM-MODE RECOGNITION 

The movie detector that we described in [22] recognizes 
movie and video formats using motion vectors integrated 
over a field period of the video signal. The current detector 
recognizes both 2-3 and 2-2 pull-down. Furthermore, 
since the parameter estimators already describe motion for 
a large portion of the image, vectors no longer need to be 
integrated. We found that a reliable movie detector can be 
realized by analysing the motion in the highest ranked 
estimator only, disregarding the zero- vector 'estimator*. 

Let us define max(/i) as the largest component of parameter 
vector P\(n), i.e. 



max(w) = max{s x (\,n) f s(\ f n)} 



(8) 



If N II' ' 



936 




Square Error (M 2 SE) of the motion vector field than the 
reference block-baaed motion estimator (Table 2). 



Tqble 2 Relative performance of the motion estimator 



Modified Mean Square Error 


,Scene « 


Block-based 


Object-based 


(Object- 
Daseu/JjiocK- 
based) 
(%) 


Bond 1 


40 


46 


115 


Bond 2 


124 


129 


104 


Girl & Fence 


100 


120 


120 


PRL Car 


81 


108 


133 


Renata 


38 


174 


458 


Subtext 


44 


98 


223 


Total 


427 


675 


158 



M 2 SE is a measurement of the true motion quality of a 
vector field [1 1]. The M 2 SE is defined as: 



M 2 SE(n) J [f(M)-^U«)] 2 < 18 ) 

xe MW 

where F (jc, n) is a frame of the test sequence on which 
F (x, n) is also calculated. PL is the number of pixels in 

HtC 

the measurement window MW that corresponds to the 
entire image, excluding a margin of M xN at the edges, 
where M and N define the vector range of the estimator 
(Table 1). The interpolated picture F (jc, n) is a motion 
compensated average resulting from shifted pictures at n - 1 
and n + 1 : 

F mc (x,n) ^ ~[F(x-D(x,n),n-\) + (19) 
F(x + D(x,n),n)] 

For more background on the M 2 SE, see [24]. The average 
M 2 SE over a 20 frame sequence is calculated to improve 
reliability. 

Table 2 shows that scenes with only camera panning 
(Bond 1) or a few fairly large moving objects (Bond 2, Girl 
& Fence, PRL Car) are only slightly degraded compared 
with scenes that can be considered difficult for the object- 
based algorithm-like scenes with objects having almost the 
same velocities (Renata) or with small moving objects 
(Subtext). 

Subjectively, the results on the first four sequences 
(Bond 1 through PRL Car) are perceptually acceptable. 
The perceptual performance of the last two sequences 
(Renata and Subtext) is unacceptable compared to the 
reference block- based algorithm. Future research will 
focus on improving' this performance. 
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5.2 Resource usage 

Table 3 shows the CPU load for the various software 
modules, and the total usage for the various formats. 
Formats that require more temporal interpolation clearly 
require more processing power (2-3 vs. 2-2 pull-down 
movie). As can be seen, the load of the parameter 
estimation and segmentation is not high compared to that 
of the scan conversion modules. This is partly achieved by 
operating the estimator and segmentation modules on 
down-scaled images as discussed in Section 2. The load of 
the segmentation depends on the film mode, because the 
segmentation operates on the real, received picture rate. 



Table 3 Resources claimed by the software. 

CPU USAGE (REAL-TIME OPERATION) 1 



Mode 


Video 


2-3 movie 


2-2 movie 




(%) 


(%) 


(%) 


Decimation 


5 


5 


5 


Parameter estimation 


6 


5 


4 


Movie detection 


0 


0 


0 


Segmentation 


28 


11 


14 


Scan conversion Y 


36 


40 


31 


Scan conversion C 


9 


0 


0 


Total 


84 


61 


54 



I . Image size; CCIR601, processor: TM1000 @ 132 MHz. 



Compared with the block-based algorithm described in 
[ 1 , 1 1], the object-based approach saves roughly a factor of 
five in calculating the match errors. This can be concluded 
from the following global calculations. 

The block-based approach calculates the match error for 
eight candidates on full resolution images with a factor of 
four sub-sampling on the match error calculation: 

8 candidates x 720 * 480 - (720 x 480) = 2 match errors/ 
4 

pixel. 

The object-based approach calculates the match error for a 
maximum of three candidates on a factor eight down- 
scaled image with all pixels included in match error 
calculation: 

3 candidates x 720 * 480 - (720 x 480) = 0.4 match errors/ 

o 

pixel. 

This gives a reasonable indication, because the 
segmentation covers more than 70 % of the resources that 
arc occupied by the motion estimator (see Table 3 under 
the video column). 




Schutten and de Haan: Real-Ti 
Motion Estimation/Compensati 



Pull-Down Elimination Applying 
a Programmable Device 



6 CONCLUSION 

Recent progress in the fields of ME, MC and CPU- 
architectures has resulted in a software package running in 
real lime on a commercially available processor, realizing 
judder-free motion of film material on TV and PC displays. 
Automatic adaptation of the processing to movie and video 
is included in the software, which supports many input and 
output scanning formats. This concept -has proven useful 
before [1]. The ME part of the design, however, is 
completely different. Rather than estimating vectors tor 
blocks, we used a newly designed object-based ME 
method. 

The new object-based motion estimator scores on average 
60 % worse based on Modified Mean Square Error (USE) 
than a block-based algorithm. There is, however, a clear 
distinction between scenes that are on average only 20 % 
worse and scenes that are more than double the M~SE. The 
scenes with only 20% worse M 2 SE are perceptually 
acceptable, while the scenes with more than double the 
M 2 SE are perceptually unacceptable. Future research is 
required to improve on this performance. 
The new object-based motion estimator saves roughly a 
factor five in resource usage compared to the block-based 
alsorithm based on match error calculations. 
A maximum of three different objects can be distinguished 
in the imaee. In this context, "objects" means image 
sections(s) "that can be described by the same motion 
model, and do not necessarily correspond to a single 
physical object in the scene. 

The software also includes a movie detector to allow 
automatic adaptation to film material. Similar to the movie 
detector of [22], it recognizes movie and video formats 
using motion vectors integrated over a field period of the 
video signal. The detector recognizes 2-3 pull-down as 
well as the 2-2 pull-down detector of [22]. 
The software supports conversion to progressive and 
interlaced TV formats, and to VGA for PCs. De-interlacing 
is therefore required, for which we applied field merging 
for movie material, and a vertical-temporal median filter 
[23] for video camera signals. 
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